Faster learning of control parameters through sharing experiences of autonomous mobile robots

نویسندگان

  • Ian Darrell Kelly
  • David A. Keating
چکیده

This paper describes a learning algorithm for small autonomous mobile robots based on sets of fuzzy automata. The task of the robots is to learn how to avoid obstacles reactively. The approach presented here is one in which two robots learn simultaneously, with the experiences of each robot being passed onto the other. It is shown that this sharing of experiences results in a faster and more repeatable learning of the robots’ behavioural parameters when compared to individual learning. Introduction Recently there has been much interest in the development of intelligent machines which learn from their past experiences [Mataric, 1996], [Mitchell, 1994], [Schaerf, 1995], [Yamaguchi, 1996]. In the field of robotics, much attention has recently been devoted to bottom-up robot control and sensor development [Asama, 1996], [Brooks, 1991], [Kelly, 1996], [Langton, 1988], [Levy, 1992], [Sugawara, 1996]. In this strategy, known as Artificial Life, robots are built starting with simple sensors and hence simpler (although not trivial) control strategies. More complex sensors and behaviour patterns can then be built on top of the more basic behaviours [Brooks, 1986]. Our first group of robots, which are known as the Seven Dwarfs, do not even have a microprocessor. Instead their controller utilises a look-up table stored in Electrically Erasable Programmable Read Only Memory (EEPROM). In conjunction with simple ultrasonic sensors, this allows the Seven Dwarfs to wander around an environment successfully, at a speed of 1ms, avoiding both stationary and moving obstacles. Recently we have developed a robot that learns how to avoid obstacles reactively in its environment [Mitchell, 1994]. This learning robot also utilises simple ultrasonic sensors and a low power microprocessor (a 4MHz Z80). It learns how to avoid obstacles successfully in only a few minutes. With the development of communicating robots [Kelly, 1996] we have started studying two-way shared experience learning. This involves each robot sharing its experiences with the other robots, which decreases the overall time taken for the robots to learn how to avoid obstacles. If the communication of data between robots is perfect then the pair of learning robots would be effectively one learning agent with twice the learning data rate and the possibility of being in two different situations at the same time. The overall learning time should therefore be inversely proportional to the number of robots learning together. The Seven Dwarfs Each of the Seven Dwarfs (as shown in figure 1) has two pairs of ultrasonic sonar transducers; each pair has a useful range of about 15 to 500mm. Together they produce 4 bits of information; one bit gives the direction of the nearest obstacle (left or right), and the remaining three bits give the range to the obstacle. Four mode select switches produce another four bits of information, which allow each insect to have 16 different control algorithms. The address formed by the data from the ultrasonics and the mode select switches is used to address an EEPROM which forms a look up table of the required motor actions for each possible obstacle position. Eight bits of data are produced by the EEPROM. Both motor drivers require four bits of information, one bit for direction and three for speed control. Each motor driver produces a pulse width modulated signal which controls the motor speed via an H-bridge of MOSFETs. The sonar sensors and look up table are read forty times a second allowing the robots to travel easily at 1ms. All feedback is derived from the ultrasonic sonar system as the motor drivers are open-loop. The aim of the robot’s controller is to stay further away from all stationary and moving obstacles than a pre-defined minimum distance, and to go forward when no objects are within range. An example controller for the robots would be for both motors to go forward if no objects are within range, and to turn away from objects that are below this minimum threshold. Turning away from objects can be accomplished by using many different strategies; for example to make the robot turn right any of the following methods could be utilised : left wheel forward whilst stopping or reversing the right wheel; or stopping the left wheel whilst reversing the right wheel. Other actions can also be performed with different controllers on the seven dwarfs including following moving objects or following the boundary of a room. The robots Five autonomous mobile robots have been constructed which are equipped with an infrared communication system as well as an improved ultrasonic sonar for detecting obstacles. The obstacle detection system consists of three sets of ultrasonic sonar transducers; one set looking forward, one set looking to the front-left and the other set looking to the front-right. Unlike the Seven Dwarfs, this sonar system returns the range to the nearest obstacle in front of each set of ultrasonic transducers. To allow the robots to detect close obstacles, echoes have to be detected whilst the ultrasonic pulse is being transmitted, thus requiring a high detection threshold. For the detection of objects further away, a much lower threshold is required to allow for the large signal loss. To allow the robots to detect both near and distant obstacles, a varying threshold system is used. This is initially large, but decreases with time to a pre-set minimum. A time-out system is used to determine if there are no objects within range. Each set of ultrasonic sonar transducers is scanned ten times per second, and has a range of 30mm to 1m with a resolution of better than 5mm. An active infrared light system is used for inter-robot communications. This is frequency division multiplexed, with each robot having its own channel. The carriers of these channels range from 220kHz to 400kHz, and the transmission medium is 950nm infrared light. In order to obtain more transmission power, and hence range, from inexpensive Light Emitting Diodes (LEDs) at these frequencies, a ring of twelve 60 degree LEDs are used. Each robot also has four photodiodes arranged 90 degrees apart (each with a half power angle of 120 degrees). This combination of LEDs and photodiodes allows transmissions to be received regardless of the angles of rotation of the robots. Information is transmitted by frequency modulating the carriers, with the decoding being carried out using an off-the-shelf radio frequency (RF) integrated circuit. Data is transceived at 1200 baud (120 bytes per second) using differential phase shift keying (DPSK), thus permitting automatic frequency control. The range of this communications system is over 5 metres, with each of the LEDs being driven by 50mA. The transmitter, therefore, requires a maximum current of about 200mA at 5V whilst the receiver requires less than 20mA, again at 5V. Physically the robots are small, having a width of 140mm, a length of 130mm and a height of 140mm (see figure 2). Each robot also weighs less than 600 grams. Movement is provided by two small d.c. motors with in-line gear boxes connected to the back wheels. The front of each robot is supported by a single castoring wheel. At present, motor control is provided by open-loop pulse width modulated controllers providing several different speeds (up to 1ms) and directional control. Each robot is controlled by a single 8MHz Z80 CPU. The processor receives an interrupt when the receiver detects a byte of data from another robot. Power is provided by four 1.2V, 1800mAH nickel metal hydride rechargeable batteries. In order to keep the centre of gravity of each robot low, the batteries are placed underneath the robots. When fully charged the batteries have a maximum terminal voltage of 5.7V, but throughout most of their discharge cycle the terminal voltage is 4.8V. Since all the electronics require just a single +5V supply, a low dropout voltage regulator is used. The motors have their own separate simple step-up regulator to provide them with a constant 6V. With maximal activity, in the worst case, the battery life is greater than five hours from a full charge. The learning algorithm The learning algorithm is based on sets of fuzzy automata, which are described by Narendra (1989). It is similar to one step Q-learning [Watkins & Dayan, 1992] except that the learning function operates directly on the “best” actions as opposed to the expected reward. The robot is required to learn a mapping from its input states to its output states which allows it to move around its environment whilst avoiding obstacles. With robots that have two bi-directional motors there are nine different possible actions (a1..a9), if the only motor speeds are forward at full speed, backwards at full speed and stop. We have defined five different circumstances the robot can find itself in : case 1 : no object near robot case 2 : obstacle in distance to the right case 3 : obstacle in distance to the left case 4 : obstacle relatively near the right case 5 : obstacle relatively near the left It should be stated that this learning strategy does not try to build a map of the environment. Instead the robot learns how to react to obstacles in each of the different circumstances. Each circumstance has its own fuzzy automaton associated with it. Each automaton, which is effectively a set of motor actions (a1..a9), has a set of probabilities of taking the associated action (p1..p9). Initially, at the start of the learning algorithm, these probabilities are set to the same value, thus for any situation initially every action has an equal chance of being selected. A weighted roulette wheel technique is used to randomly select the most appropriate action for the given input (obstacle position). The action with the highest probability is the most likely to be chosen. This method is similar to techniques which are used in genetic algorithms [Goldberg, 1989]. The chosen action is executed for a short period of time and is then evaluated. If the action was successful, its probability of being selected is increased. If the action was a failure, its probability of being selected is decreased. In both cases the other probabilities of the chosen automaton are adjusted in order to keep the total probability constant. If the performance value is Alpha (positive meaning successful and negative meaning unsuccessful), n is the action that was chosen and j is all the actions except for n, then the rules for adjusting the probabilities are : Alpha = e (make Alpha non-linear) Alpha = Alpha / Factor (scale Alpha) IF Alpha >= 0 THEN (the action was successful) pn = pn + Alpha (increase probability of chosen action) pj = pj ( Alpha / 8 ) (decrease other probabilities) ELSE (the action was unsuccessful) pn = pn Alpha (decrease probability of chosen action) pj = pj + (Alpha / 8 ) (increase other probabilities) END To emphasise both very good and very bad actions alpha is made exponential. It is then divided by one of three different values (Factor = 4, 2 or 1) which allow the effects of different learning rates to be investigated. Since we are only using a slow processor all of the probabilities are stored as signed 16-bit integers rather than as fractions. In order to evaluate Alpha, a definition of which actions are good and which actions are bad is required. Rules were chosen which are general and hence would not give the robot any information about which motor actions to select. The basic rules are that if there is no object within range it is good to go forward but if an object is relatively near it is good to get further away from it. Lastly, if the object is in the distance it is still good to go forwards but it is also good to get further away from it. The rules used for calculating Alpha are : case 1 : (no object near robot) Alpha = ( LeftSpeed + RightSpeed ) * 3 case 2 and case 3 : (obstacle in distance) Alpha = ( ( RightSpeed + LeftSpeed ) * 2 + FrontDir ) / 2 case 4 and case 5 : (obstacle relatively near) Alpha = FrontDir where : LeftSpeed and RightSpeed are the current motor speeds; +1 for forwards, -1 for backwards and 0 for stop; and FrontDir is the change in distance of the nearest object to the front sensor. For changes in distance greater than +/-10mm it is positive when the robot is getting further away and negative otherwise, but is bounded to 35mm (when FrontDir is +/9) The main adaptive loop of the program is : Choose probability set (automaton) according to obstacle position, Choose action an of this automation, Move robot for short period of time, Evaluate Action (derive value of Alpha) and update probabilities (using nonlinear scaled Alpha). Only the range returned by the front sonar sensor is used for determining whether the robot has successfully moved further away from an obstacle. Using the left and right sonar sensors can lead the robot into thinking that it is successfully moving further away from the obstacle when it is actually moving towards it, as shown in figure 3. This problem occurs from -13 to +13 and is overcome by using the front sonar sensor as shown by the solid line. Dual Learning For bi-directional dual learning, each robot has to transmit the position of the object to which it reacted to (i.e. case 1 to 5), the action that it chose (i.e. 0 to 8) and how good or bad that action was (i.e. the Alpha factor) with relation to the objects position and the chosen motor actions. To allow for the relatively low rate of data transmission and the scanning of up to four robots, six sets of this data are compressed into a packet of data along with start of packet/synchronisation codes, an incremental packet ID and a two byte check sum. The ID code allows for the detection of missed packets since if the ID of this packet is not one greater than the last packets ID then one or more packets of data have been missed. On average the loss of packets is about 10%. The main adaptive loop of the program now becomes : Choose probability set (automaton) according to obstacle position, Choose action an of this automation, Move robot for short period of time, Evaluate Action (derive value of Alpha) and update probabilities (using non-linear scaled Alpha). Collect and Transmit information to other robot(s). (data from the other robots is handled under interrupt. The robots probabilities are updated when a complete packet of data is received) Results The test set-up consisted of two robots each in its own environment bounded by an elliptical wall of 4.9m circumference and 120mm height. Tests were conducted with two robots running the learning algorithm at the same time, both with and without the sharing of experiences. The fitness was calculated by comparing the values in the tables of probabilities with “ideal” values determined by the authors. The fitness was evaluated five times a second and the average logged every second. Three tests, each consisting of two robots, were performed under each test condition, thus giving six sets of data to allow an average learning rate to be determined. Graphs of fitness vs. time are shown in figures 4 to 6 and figure 9. Figure 4 shows the learning with the normal Alpha factor which had previously been determined to be the best compromise between speed of learning and robustness. The results for the six runs are shown for the single robot and dual learning cases, with the averages of the six runs shown as a bold line. The final graph shows the two averages for comparison. Figure 5 shows the same graphs but with an Alpha factor of half the normal value whilst figure 6 shows the same graphs again but with an Alpha factor of double the normal value. In each case the shared experience shows an improved learning rate over the individual case. The learning is more repeatable when the robots have the other robots experiences to help reduce the effects of noise. The top graph of figure 7 shows a comparison of the differential with respect to time of both the averaged single and dual robot learning with the normal Alpha factor. The middle graph shows the same graphs but with an Alpha factor of half the normal value, whilst the bottom graph of figure 7 shows the same graphs again, but with an Alpha factor of double the normal value. The graphs of figure 7 therefore show the mean learning rate of each group of robots with respect to time. It can be seen that the initial learning rate of dual learning is significantly better with all of the different Alpha factors. The peak value from these graphs are : Type of learning Normal Alpha Half normal Alpha Twice normal Alpha Single robot 0.376 0.238 0.634 Dual learning 0.727 0.445 0.861 With both the normal Alpha factor, and half the normal Alpha factor the peak learning rate of dual robot learning is almost twice that of single robot learning. In both these cases dual learning does not achieve the expected doubling of the learning rate over that of single robot learning because of the 10% loss of information over the communications system. With twice the normal Alpha factor the dual learning goes unstable, due to too much gain, and does not achieve the expected increase in the learning rate. Figure 8 shows plots of the standard deviations of the six sets of learning results. The top graph shows a comparison of the standard deviations against time of both the single and dual robot learning with the normal Alpha factor. The middle graph shows the same graphs but with an Alpha factor of half the normal value, whilst the bottom graph of figure 8 shows the same graphs again, but with an Alpha factor of double the normal value. The averages of these standard deviations are : Type of learning Normal Alpha Half normal Alpha Twice normal Alpha Single robot 4.458 3.692 5.737 Dual learning 3.365 1.907 3.832 In all cases dual learning gives a lower mean standard deviation than single robot learning. Thus proving that dual shared experience robot learning produces more repeatable results than that of single robot learning. A comparison between one robot learning with a given Alpha factor and dual learning with each robot having half the single robot’s Alpha factor is shown in figure 9. In the first case the single robot has the normal Alpha factor whilst in the second case the single robot has double the normal Alpha factor. When the single robot has the optimal value of Alpha the dual learning with half the value shows virtually no advantage in speed of learning but shows a greater immunity to noise. When the Alpha factor is increased the learning rate for the dual learning shows not only increased noise immunity but is also slightly faster. Conclusions The results clearly show that shared experience learning is faster and more repeatable than individual learning. The use of shared experience learning also allows the advantages of increased Alpha factors to be exploited where the effects of noise would have prevented this in the individual case. The authors are currently investigating the effects of mutual learning with more than two robots sharing each others experiences. References Asama, H., Fukuda, T., Arai, T. & Endo, I. (Eds) (1996) “Distributed autonomous robotic systems 2.” Springer-Verlag. Brooks, R.A. (1986) “A robust layered control system for a mobile robot.” IEEE Journal of Robotics and Automation. Vol RA-2, no. 1, pp 14-23. Brooks, R.A. (1991) “Artificial life and real robots.” Towards a practice of autonomous systems : Proceedings of the first European conference on artificial life. Eds F.J. Varela and P. Bourgine. MIT press. Goldberg, D.E. (1989) “Genetic algorithms.” Addison Wesley. Kelly, I.D. & Keating, D.A. (1996) “Flocking by the fusion of sonar and active infrared sensors on physical autonomous mobile robots.” Proc. Mechatronics ‘96. Langton, C. (1988) “Artificial life.” In Artificial Life, Vol. VI, Santa Fe Institute studies in the sciences of complexity. Addison Wesley. Levy, S. (1992) “Artificial life : The quest for a new creation.” Penguin Books. Mataric, M.J. (1996) “Learning in Multi-Robot Systems.” In Adaptation and Learning in Multi-Agent Systems , Gerhard Weiss and Sandip Sen, eds., Lecture Notes In Artificial Intelligence (LNAI), Vol. 1042, Springer-Verlag. Mitchell, R.J., Keating, D.A. & Kambhampati, C. (1994) “Neural network controller for mobile robot insect.” Proc. EURISCON ‘94. Narendra, K. & Thathachar, M. (1989) “Learning Automata : an introduction.” Prentice-Hall. Schaerf, A., Shoham, Y., & Tennenholtz, M. (1995) “Adaptive load balancing : A study in multiagent learning.” Journal of Artificial Intelligence Research 2. May. Sugawara, K. & Sano, M. (1996) “Co-operative acceleration of task performance : Foraging behaviour of interacting multi-robots system.” In Distributed autonomous robotic systems 2, eds. Asama, H., Fukuda, T., Arai, T., Endo, I. Springer-Verlag. Watkins, C.J.C.H. & Dayan, P. (1992) “Technical note : Q-learning.” Machine Learning 8. Kluwer Academic Publishers, Boston. pp 279-292. Yamaguchi, T., Masubuchi, M., Fujihara & K. Yachida, M. (1996) “Real-time reinforcement learning for a real robot in the real environment.” IROS 96. Osaka. Japan. Figure 1, block diagram of one of the seven dwarfs. Figure 2, four of the learning robots. 40 30 20 10 0 10 20 30 40 0 20 40 60 80 100

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Increased Learning Rates Through the Sharing of Experiences of Multiple Autonomous Mobile Robot Agents

This paper describes a reinforcement learning algorithm for small autonomous mobile robot agents based on sets of fuzzy automata. The task of the robots is to learn how to reactively avoid obstacles. In the approach presented here two or four robots learn simultaneously, with the experiences of each robot being passed onto the other(s). It is shown that an increasing number of robots sharing th...

متن کامل

Increasing Mobile Robot Learning Rates through Sharing of Experiences

This paper describes a learning algorithm for autonomous mobile robots based on sets of fuzzy automata. The task the robots have to learn is how to avoid obstacles reactively. The approach presented here is one in which two or four robots learn simultaneously, with the experiences of each robot being passed onto the others. It is shown that this sharing of experiences results in faster and more...

متن کامل

Are Autonomous Mobile Robots Able to Take Over Construction? A Review

Although construction has been known as a highly complex application field for autonomous robotic systems, recent advances in this field offer great hope for using robotic capabilities to develop automated construction. Today, space research agencies seek to build infrastructures without human intervention, and construction companies look to robots with the potential to improve construction qua...

متن کامل

Non-Singular Terminal Sliding Mode Control of a Nonholonomic Wheeled Mobile Robots Using Fuzzy Based Tyre Force Estimator

This paper, proposes a methodology to implement a suitable nonsingular terminal sliding mode controller associated with the output feedback control to achieve a successful trajectory tracking of a non-holonomic wheeled mobile robot in presence of longitudinal and lateral slip accompanied. This implementation offers a relatively faster and high precision tracking performance. We investigate this...

متن کامل

Effective Mechatronic Models and Methods for Implementation an Autonomous Soccer Robot

  Omni directional mobile robots have been popularly employed in several applications especially in soccer player robots considered in Robocup competitions. However, Omni directional navigation system, Omni-vision system and solenoid kicking mechanism in such mobile robots have not ever been combined. This situation brings the idea of a robot with no head direction into existence, a comprehensi...

متن کامل

Delay Compensation on Fuzzy Trajectory Tracking Control of Omni-Directional Mobile Robots

This paper presents a delay compensator fuzzy control for trajectory tracking of omni-directional mobile robots. Fuzzy logic control (FLC) of the robots is a suitable strategy for dealing with model uncertainties, nonlinearities and disturbances.  On the other hand, in many robotic applications such as mobile robots, delay phenomenon is able to substantially deteriorate the behavior of system's...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Int. J. Systems Science

دوره 29  شماره 

صفحات  -

تاریخ انتشار 1998